The Tdt-3 Text and Speech Corpus
نویسندگان
چکیده
The TDT-3 Text and Speech Corpus expands on previous phases of Topic Detection and Tracking data collections, by increasing the number of news sources being sampled, by including Mandarin Chinese as well as English news data, and by introducing new forms of topic annotation. In order to satisfy the specific data and annotation requirements of the TDT-3 Evaluation Plan[1], the LDC refined and supplemented the methods that had been used in TDT-2 corpus development[2]. There were significant changes and improvements in the process of selecting and defining target topics,in the procedures for quality assurance applied to both data content and annotations, and in the organization of the delivered corpus. In addition, the LDC created or acquired a range of resources to support research in cross-language information retrieval. These included the addition of a Mandarin Chinese component to the TDT-2 Text and Speech Corpus, the collection of a large body of Chinese-English parallel text, and adaptation of Chinese-to-English and English-to-Chinese glossing lexicons. All the resources that we have developed for use by the participants in the TDT-3 Evaluation are being added to the LDC’s catalog of corpora for general availability.
منابع مشابه
The Tdt-2 Text and Speech Corpus
This paper describes the creation and content of the TDT-2 corpus in the context of the TDT-2 research project it supports and in comparison to previous and subsequent efforts
متن کاملImproved spoken document retrieval by exploring extra acoustic and linguistic cues
In this paper, we explored the use of various extra information to improve the performance of spoken document retrieval (SDR). From the speech recognition perspective, we incorporated the acoustic stress and word confusion information into the audio indexing. From the linguistic perspective, we applied the partof-speech information in both the audio indexing and the query representation. From t...
متن کاملپیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی
Named entity recognition (NER) is a natural language processing (NLP) problem that is mainly used for text summarization, data mining, data retrieval, question and answering, machine translation, and document classification systems. A NER system is tasked with determining the border of each named entity, recognizing its type and classifying it into predefined categories. The categories of named...
متن کاملMany Uses, Many Annotations for Large Speech Corpora: Switchboard and TDT as Case Studies
This paper discusses the challenges that arise when large speech corpora receive an ever-broadening range of diverse and distinct annotations. Two case studies of this process are presented: the Switchboard Corpus of telephone conversations and the TDT2 corpus of broadcast news. Switchboard has undergone two independent transcriptions and various types of additional annotation, all carried out ...
متن کاملTransliteration of Proper Names in Cross-Lingual Information Retrieval
We address the problem of transliterating English names using Chinese orthography in support of cross-lingual speech and text processing applications. We demonstrate the application of statistical machine translation techniques to “translate” the phonemic representation of an English name, obtained by using an automatic text-to-speech system, to a sequence of initials and finals, commonly used ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1999